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Abstract. Statistical modeling is a powerful tool for developing and 
testing theories by way of causal explanation, prediction, and descrip- 
tion. In many disciplines there is near-exclusive use of statistical mod- 
eling for causal explanation and the assumption that models with high 
explanatory power are inherently of high predictive power. Conflation 
between explanation and prediction is common, yet the distinction 
must be understood for progressing scientific knowledge. While this 
distinction has been recognized in the philosophy of science, the statis- 
tical literature lacks a thorough discussion of the many differences that 
arise in the process of modeling for an explanatory versus a predictive 
goal. The purpose of this article is to clarify the distinction between 
explanatory and predictive modeling, to discuss its sources, and to re- 
veal the practical implications of the distinction to each step in the 
modeling process. 

Key words and phrases: Explanatory modeling, causality, predictive 
modeling, predictive power, statistical strategy, data mining, scientific 
research. 
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1. INTRODUCTION 

Looking at how statistical models are used in dif- 
ferent scientific disciplines for the purpose of the- 
ory building and testing, one finds a range of per- 
ceptions regarding the relationship between causal 
explanation and empirical prediction. In many sci- 
entific fields such as economics, psychology, educa- 
tion, and environmental science, statistical models 
are used almost exclusively for causal explanation, 
and models that possess high explanatory power 
are often assumed to inherently possess predictive 
power. In fields such as natural language processing 
and bioinformatics, the focus is on empirical predic- 
tion with only a slight and indirect relation to causal 
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explanation. And yet in other research fields, such 
as epidemiology, the emphasis on causal explanation 
versus empirical prediction is more mixed. Statisti- 
cal modeling for description, where the purpose is 
to capture the data structure parsimoniously, and 
which is the most commonly developed within the 
field of statistics, is not commonly used for theory 
building and testing in other disciplines. Hence, in 
this article I focus on the use of statistical mod- 
eling for causal explanation and for prediction. My 
main premise is that the two are often conflated, yet 
the causal versus predictive distinction has a large 
impact on each step of the statistical modeling pro- 
cess and on its consequences. Although not explic- 
itly stated in the statistics methodology literature, 
applied statisticians instinctively sense that predict- 
ing and explaining are different. This article aims to 
fill a critical void: to tackle the distinction between 
explanatory modeling and predictive modeling. 

Clearing the current ambiguity between the two is 
critical not only for proper statistical modeling, but 
more importantly, for proper scientific usage. Both 
explanation and prediction are necessary for gener- 
ating and testing theories, yet each plays a differ- 
ent role in doing so. The lack of a clear distinction 
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within statistics has created a lack of understand- 
ing in many disciplines of the difference between 
building sound explanatory models versus creating 
powerful predictive models, as well as confusing ex- 
planatory power with predictive power. The impli- 
cations of this omission and the lack of clear guide- 
lines on how to model for explanatory versus pre- 
dictive goals are considerable for both scientific re- 
search and practice and have also contributed to the 
gap between academia and practice. 

I start by defining what I term explaining and 
predicting. These definitions are chosen to reflect 
the distinct scientific goals that they are aimed at: 
causal explanation and empirical prediction, respec- 
tively. Explanatory modeling and predictive model- 
ing reflect the process of using data and statistical 
(or data mining) methods for explaining or predict- 
ing, respectively. The term modeling is intentionally 
chosen over models to highlight the entire process in- 
volved, from goal definition, study design, and data 
collection to scientific use. 

1.1 Explanatory Modeling 

In many scientific fields, and especially the social 
sciences, statistical methods are used nearly exclu- 
sively for testing causal theory. Given a causal theo- 
retical model, statistical models are applied to data 
in order to test causal hypotheses. In such mod- 
els, a set of underlying factors that are measured 
by variables X are assumed to cause an underlying 
effect, measured by variable Y. Based on collabora- 
tive work with social scientists and economists, on 
an examination of some of their literature, and on 
conversations with a diverse group of researchers, I 
conjecture that, whether statisticians like it or not, 
the type of statistical models used for testing causal 
hypotheses in the social sciences are almost always 
association-based models applied to observational 



data. Regression models are the most common ex- 
ample. The justification for this practice is that the 
theory itself provides the causality. In other words, 
the role of the theory is very strong and the reliance 
on data and statistical modeling are strictly through 
the lens of the theoretical model. The theory-data 
relationship varies in different fields. While the so- 
cial sciences are very theory-heavy, in areas such as 
bioinformatics and natural language processing the 
emphasis on a causal theory is much weaker. Hence, 
given this reality, I define explaining as causal ex- 
planation and explanatory modeling as the use of 
statistical models for testing causal explanations. 

To illustrate how explanatory modeling is typi- 
cally done, I describe the structure of a typical arti- 
cle in a highly regarded journal in the field of Infor- 
mation Systems (IS). Researchers in the field of IS 
usually have training in economics and/or the be- 
havioral sciences. The structure of articles refiects 
the way empirical research is conducted in IS and 
related fields. 

The example used is an article by Gefen, Kara- 
hanna and Straub (2003), which studies technology 
acceptance. The article starts with a presentation of 
the prevailing relevant theory (ies): 

Online purchase intensions should be ex- 
plained in part by the technology accep- 
tance model (TAM). This theoretical model 
is at present a preeminent theory of tech- 
nology acceptance in IS. 

The authors then proceed to state multiple causal 
hypotheses (denoted Hi,H2,... in Figure 1, right 
panel), justifying the merits for each hypothesis and 
grounding it in theory. The research hypotheses are 
given in terms of theoretical constructs rather than 
measurable variables. Unlike measurable variables, 
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H,: PU will positively affect intended use of a 
business-to-consumer (B2C) Web site. 

PEOU will positively affect intended use of a 
business-to-consumer (B2C) Web site. 

Hji PEOU will positively affect PU of a business- 
to-consumer (B2C) Web site. 



Figure 1 Research Modef 



Fig. 1. Causal diagram (left) and partial list of stated hypotheses (right) from Gefen, Karahanna and Straub (2003). 
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constructs are abstractions that "describe a phe- 
nomenon of theoretical interest" (Edwards and 
Bagozzi, 2000) and can be observable or unobserv- 
able. Examples of constructs in this article are trust, 
perceived usefulness (PU), and perceived ease of use 
(PEOU). Examples of constructs used in other fields 
include anger, poverty, well-being, and odor. The hy- 
potheses section will often include a causal diagram 
illustrating the hypothesized causal relationship be- 
tween the constructs (see Figure 1, left panel). The 
next step is construct operationalization, where a 
bridge is built between theoretical constructs and 
observable measurements, using previous literature 
and theoretical justification. Only after the theoret- 
ical component is completed, and measurements are 
justified and defined, do researchers proceed to the 
next step where data and statistical modeling are in- 
troduced alongside the statistical hypotheses, which 
are operationalized from the research hypotheses. 
Statistical inference will lead to "statistical conclu- 
sions" in terms of effect sizes and statistical sig- 
nificance in relation to the causal hypotheses. Fi- 
nally, the statistical conclusions are converted into 
research conclusions, often accompanied by policy 
recommendations . 

In summary, explanatory modeling refers here to 
the application of statistical models to data for test- 
ing causal hypotheses about theoretical constructs. 
Whereas "proper" statistical methodology for test- 
ing causality exists, such as designed experiments 
or specialized causal inference methods for observa- 
tional data [e.g., causal diagrams (Pearl, 1995), dis- 
covery algorithms (Spirtes, Glymour and Scheines, 
2000), probability trees (Shafer, 1996), and propen- 
sity scores (Rosenbaum and Rubin, 1983; Rubin, 
1997)], in practice association-based statistical mod- 
els, applied to observational data, are most com- 
monly used for that purpose. 

1.2 Predictive Modeling 

I define predictive modeling as the process of ap- 
plying a statistical model or data mining algorithm 
to data for the purpose of predicting new or future 
observations. In particular, I focus on nonstochastic 
prediction (Geisser, 1993, page 31), where the goal 
is to predict the output value (Y) for new observa- 
tions given their input values (X). This definition 
also includes temporal forecasting, where observa- 
tions until time t (the input) are used to forecast 
future values at time t + k,k > (the output). Pre- 
dictions include point or interval predictions, pre- 
diction regions, predictive distributions, or rankings 



of new observations. Predictive model is any method 
that produces predictions, regardless of its underly- 
ing approach: Bayesian or frequentist, parametric or 
nonparametric, data mining algorithm or statistical 
model, etc. 

1.3 Descriptive Modeling 

Although not the focus of this article, a third 
type of modeling, which is the most commonly used 
and developed by statisticians, is descriptive mod- 
eling. This type of modeling is aimed at summariz- 
ing or representing the data structure in a compact 
manner. Unlike explanatory modeling, in descriptive 
modeling the reliance on an underlying causal the- 
ory is absent or incorporated in a less formal way. 
Also, the focus is at the measurable level rather than 
at the construct level. Unlike predictive modeling, 
descriptive modeling is not aimed at prediction. Fit- 
ting a regression model can be descriptive if it is 
used for capturing the association between the de- 
pendent and independent variables rather than for 
causal inference or for prediction. We mention this 
type of modeling to avoid confusion with causal- 
explanatory and predictive modeling, and also to 
highlight the different approaches of statisticians and 
nonstatisticians. 

1.4 The Scientific Value of Predictive Modeling 

Although explanatory modeling is commonly used 
for theory building and testing, predictive modeling 
is nearly absent in many scientific fields as a tool 
for developing theory. One possible reason is the 
statistical training of nonstatistician researchers. A 
look at many introductory statistics textbooks re- 
veals very little in the way of prediction. Another 
reason is that prediction is often considered unsci- 
entific. Berk (2008) wrote, "In the social sciences, 
for example, one either did causal modeling econo- 
metric style or largely gave up quantitative work." 
From conversations with colleagues in various disci- 
plines it appears that predictive modeling is often 
valued for its applied utility, yet is discarded for sci- 
entific purposes such as theory building or testing. 
Shmueli and Koppius (2010) illustrated the lack of 
predictive modeling in the field of IS. Searching the 
1072 papers published in the two top-rated journals 
Information Systems Research and MIS Quarterly 
between 1990 and 2006, they found only 52 empirical 
papers with predictive claims, of which only seven 
carried out proper predictive modeling or testing. 
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Even among academic statisticians, there appears 
to be a divide between those who value prediction as 
the main purpose of statistical modeling and those 
who see it as unacademic. Examples of statisticians 
who emphasize predictive methodology include 
Akaike ("The predictive point of view is a proto- 
typical point of view to explain the basic activity of 
statistical analysis" in Findley and Parzen, 1998), 
Deming ( "The only useful function of a statistician 
is to make predictions" in Wallis, 1980), Geisser 
( "The prediction of observables or potential observ- 
ables is of much greater relevance than the estimate 
of what are often artificial constructs-parameters," 
Geisser, 1975), Aitchison and Dunsmore ("predic- 
tion analysis. . . is surely at the heart of many statis- 
tical applications," Aitchison and Dunsmore, 1975) 
and Friedman ("One of the most common and im- 
portant uses for data is prediction," Friedman, 1997). 
Examples of those who see it as unacademic are 
Kendall and Stuart ( "The Science of Statistics deals 
with the properties of populations. In considering a 
population of men we are not interested, statistically 
speaking, in whether some particular individual has 
brown eyes or is a forger, but rather in how many 
of the individuals have brown eyes or are forgers," 
Kendall and Stuart, 1977) and more recently Parzen 
( "The two goals in analyzing data. . . I prefer to 
describe as "management" and "science." Manage- 
ment seeks profit. . . Science seeks truth," Parzen, 

2001) . In economics there is a similar disagreement 
regarding "whether prediction per se is a legitimate 
objective of economic science, and also whether ob- 
served data should be used only to shed light on ex- 
isting theories or also for the purpose of hypothesis 
seeking in order to develop new theories" (Feelders, 

2002) . 

Before proceeding with the discrimination between 
explanatory and predictive modeling, it is impor- 
tant to establish prediction as a necessary scientific 
endeavor beyond utility, for the purpose of devel- 
oping and testing theories. Predictive modeling and 
predictive testing serve several necessary scientific 
functions: 

1. Newly available large and rich datasets often con- 
tain complex relationships and patterns that are 
hard to hypothesize, especially given theories that 
exclude newly measurable concepts. Using pre- 
dictive modeling in such contexts can help un- 
cover potential new causal mechanisms and lead 
to the generation of new hypotheses. See, for ex- 
ample, the discussion between Gurbaxani and 



Mendelson (1990, 1994) and Collopy, Adya and 
Armstrong (1994). 

2. The development of new theory often goes hand 
in hand with the development of new measures 
(Van Maanen, Sorensen and Mitchell, 2007). Pre- 
dictive modeling can be used to discover new 
measures as well as to compare different oper- 
ationalizations of constructs and different mea- 
surement instruments. 

3. By capturing underlying complex patterns and 
relationships, predictive modeling can suggest im- 
provements to existing explanatory models. 

4. Scientific development requires empirically rig- 
orous and relevant research. Predictive model- 
ing enables assessing the distance between theory 
and practice, thereby serving as a "reality check" 
to the relevance of theories.^ While explanatory 
power provides information about the strength 
of an underlying causal relationship, it does not 
imply its predictive power. 

5. Predictive power assessment offers a straightfor- 
ward way to compare competing theories by ex- 
amining the predictive power of their respective 
explanatory models. 

6. Predictive modeling plays an important role in 
quantifying the level of predictability of measur- 
able phenomena by creating benchmarks of pre- 
dictive accuracy (Ehrenberg and Bound, 1993). 
Knowledge of un-predictability is a fundamen- 
tal component of scientific knowledge (see, e.g., 
Taleb, 2007). Because predictive models tend to 
have higher predictive accuracy than explanatory 
statistical models, they can give an indication of 
the potential level of predictability. A very low 
predictability level can lead to the development 
of new measures, new collected data, and new 
empirical approaches. An explanatory model that 
is close to the predictive benchmark may suggest 
that our understanding of that phenomenon can 
only be increased marginally. On the other hand, 
an explanatory model that is very far from the 
predictive benchmark would imply that there are 
substantial practical and theoretical gains to be 
had from further scientific development. 

For a related, more detailed discussion of the value 
of prediction to scientific theory development see the 
work of Shmueli and Koppius (2010). 



Predictive models are advantageous in terms of negative 
empiricism: a model either predicts accurately or it does not, 
and this can be observed. In contrast, explanatory models can 
never be confirmed and are harder to contradict. 
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1.5 Explaining and Predicting Are Different 

In the philosophy of science, it has long been de- 
bated whether explaining and predicting are one or 
distinct. The conflation of explanation and predic- 
tion has its roots in philosophy of science literature, 
particularly the influential hypothetico-deductive 
model (Hempel and Oppenheim, 1948), which ex- 
plicitly equated prediction and explanation. How- 
ever, as later became clear, the type of uncertainty 
associated with explanation is of a different nature 
than that associated with prediction (Helmer and 
Rescher, 1959). This difference highlighted the need 
for developing models geared specifically toward deal- 
ing with predicting future events and trends such as 
the Delphi method (Dalkey and Helmer, 1963). The 
distinction between the two concepts has been fur- 
ther elaborated (Forster and Sober, 1994; Forster, 
2002; Sober, 2002; Hitchcock and Sober, 2004; Dowe, 
Gardner and Oppy, 2007). In his book Theory Build- 
ing^ Dubin (1969, page 9) wrote: 

Theories of social and human behavior ad- 
dress themselves to two distinct goals of 
science: (1) prediction and (2) understand- 
ing. It will be argued that these are sep- 
arate goals [. . . ] I will not, however, con- 
clude that they are either inconsistent or 
incompatible. 

Herbert Simon distinguished between "basic science" 
and "applied science" (Simon, 2001), a distinction 
similar to explaining versus predicting. According 
to Simon, basic science is aimed at knowing ( "to de- 
scribe the world") and understanding ("to provide 
explanations of these phenomena"). In contrast, in 
applied science, "Laws connecting sets of variables 
allow inferences or predictions to be made from known 
values of some of the variables to unknown values of 
other variables." 

Why should there be a difference between explain- 
ing and predicting? The answer lies in the fact that 
measurable data are not accurate representations 
of their underlying constructs. The operationaliza- 
tion of theories and constructs into statistical mod- 
els and measurable data creates a disparity between 
the ability to explain phenomena at the conceptual 
level and the ability to generate predictions at the 
measurable level. 

To convey this disparity more formally, consider 
a theory postulating that construct X causes con- 
struct via the function such that 3^ = 



J- is often represented by a path model, a set of 
qualitative statements, a plot (e.g., a supply and 
demand plot), or mathematical formulas. Measur- 
able variables X and Y are operationalizations of 
X and y, respectively. The operationalization of F 
into a statistical model /, such as EiY) = /(X), is 
done by considering F in light of the study design 
(e.g., numerical or categorical Y; hierarchical or flat 
design; time series or cross-sectional; complete or 
censored data) and practical considerations such as 
standards in the discipline. Because F is usually not 
sufficiently detailed to lead to a single /, often a set 
of / models is considered. Feelders (2002) described 
this process in the field of economics. In the predic- 
tive context, we consider only X, Y and /. 

The disparity arises because the goal in explana- 
tory modeling is to match / and F as closely as 
possible for the statistical inference to apply to the 
theoretical hypotheses. The data X,y are tools for 
estimating /, which in turn is used for testing the 
causal hypotheses. In contrast, in predictive mod- 
eling the entities of interest are X and Y , and the 
function / is used as a tool for generating good pre- 
dictions of new Y values. In fact, we will see that 
even if the underlying causal relationship is indeed 
y = F{X), a function other than /(X) and data 
other than X might be preferable for prediction. 

The disparity manifests itself in different ways. 
Four major aspects are: 

Causation-Association: In explanatory modeling / 
represents an underlying causal function, and X 
is assumed to cause Y . In predictive modeling / 
captures the association between X and 3^. 

Theory-Data: In explanatory modeling, / is care- 
fully constructed based on in a fashion that 
supports interpreting the estimated relationship 
between X and Y and testing the causal hypothe- 
ses. In predictive modeling, / is often constructed 
from the data. Direct interpretability in terms of 
the relationship between X and Y is not required, 
although sometimes transparency of / is desirable. 

Retrospective-Prospective: Predictive modeling is 
forward-looking, in that / is constructed for pre- 
dicting new observations. In contrast, explanatory 
modeling is retrospective, in that / is used to test 
an already existing set of hypotheses. 

Bias-Variance: The expected prediction error for a 
new observation with value x, using a quadratic 
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loss function,^ is given by Hastie, Tibshirani and 
Friedman (2009, page 223) 

EPE = E{Y - f{x)}'^ 

= E{Y-f{x)f + {Eif(x))-fix)}' 

+ E{fix)-Eifixm' 
= Var(y) + Bias^ + Var(/(3;)). 

Bias is the result of misspecifying the statistical 
model /. Estimation variance (the third term) is 
the result of using a sample to estimate /. The 
first term is the error that results even if the model 
is correctly specified and accurately estimated. The 
above decomposition reveals a source of the differ- 
ence between explanatory and predictive model- 
ing: In explanatory modeling the focus is on mini- 
mizing bias to obtain the most accurate represen- 
tation of the underlying theory. In contrast, pre- 
dictive modeling seeks to minimize the combina- 
tion of bias and estimation variance, occasionally 
sacrificing theoretical accuracy for improved em- 
pirical precision. This point is illustrated in the 
Appendix, showing that the "wrong" model can 
sometimes predict better than the correct one. 

The four aspects impact every step of the mod- 
eling process, such that the resulting / is markedly 
different in the explanatory and predictive contexts, 
as will be shown in Section 2. 

1.6 A Void in the Statistics Literature 

The philosophical explaining/predicting debate has 
not been directly translated into statistical language 
in terms of the practical aspects of the entire statis- 
tical modeling process. 

A search of the statistics literature for discussion 
of explaining versus predicting reveals a lively dis- 
cussion in the context of model selection, and in par- 
ticular, the derivation and evaluation of model selec- 
tion criteria. In this context, Konishi and Kitagawa 
(2007) wrote: 

There may be no significant difference be- 
tween the point of view of inferring the 
true structure and that of making a pre- 
diction if an infinitely large quantity of 



■^For a binary Y, various 0-1 loss functions have been sug- 
gested in place of the quadratic loss function (Domingos, 
2000). 



data is available or if the data are noise- 
less. However, in modeling based on a fi- 
nite quantity of real data, there is a signifi- 
cant gap between these two points of view, 
because an optimal model for prediction 
purposes may be different from one ob- 
tained by estimating the 'true model.' 

The literature on this topic is vast, and we do not in- 
tend to cover it here, although we discuss the major 
points in Section 2.6. 

The focus on prediction in the field of machine 
learning and by statisticians such as Geisser, Aitchi- 
son and Dunsmore, Breiman and Friedman, has high- 
lighted aspects of predictive modeling that are rel- 
evant to the explanatory/prediction distinction, al- 
though they do not directly contrast explanatory 
and predictive modeling.^ The prediction literature 
raises the importance of evaluating predictive power 
using holdout data, and the usefulness of algorith- 
mic methods (Breiman, 2001b). The predictive fo- 
cus has also led to the development of inference 
tools that generate predictive distributions. Geisser 
(1993) introduced "predictive inference" and devel- 
oped it mainly in a Bayesian context. "Predictive 
likelihood" (see Bjornstad, 1990) is a likelihood-based 
approach to predictive inference, and Dawid's pre- 
quential theory (Dawid, 1984) investigates inference 
concepts in terms of predictability. Finally, the bias- 
variance aspect has been pivotal in data mining for 
understanding the predictive performance of differ- 
ent algorithms and for designing new ones. 

Another area in statistics and econometrics that 
focuses on prediction is time series. Methods have 
been developed specifically for testing the predictabil- 
ity of a series [e.g., random walk tests or the concept 
of Granger causality (Granger, 1969)], and evalu- 
ating predictability by examining performance on 
holdout data. The time series literature in statis- 
tics is dominated by extrapolation models such as 
ARIMA-type models and exponential smoothing meth- 
ods, which are suitable for prediction and descrip- 
tion, but not for causal explanation. Causal models 
for time series are common in econometrics (e.g., 
Song and Witt, 2000), where an underlying causal 
theory links constructs, which lead to operational- 
ized variables, as in the cross-sectional case. Yet, to 



^Geisser distinguished between "[statistical] parameters" 
and "observables" in terms of the objects of interest. His dis- 
tinction is closely related, but somewhat different from our 
distinction between theoretical constructs and measurements. 
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the best of my knowledge, there is no discussion in 
the statistics time series literature regarding the dis- 
tinction between predictive and explanatory model- 
ing, aside from the debate in economics regarding 
the scientific value of prediction. 

To conclude, the explanatory /predictive model- 
ing distinction has been discussed directly in the 
model selection context, but not in the larger con- 
text. Areas that focus on developing predictive mod- 
eling such as machine learning and statistical time 
series, and "predictivists" such as Geisser, have con- 
sidered prediction separate issue, and have not 
discussed its principal and practical distinction from 
causal explanation in terms of developing and test- 
ing theory. The goal of this article is therefore to 
examine the explanatory versus predictive debate 
from a statistical perspective, considering how mod- 
eling is used by nonstatistician scientists for theory 
development. 

The remainder of the article is organized as fol- 
lows. In Section 2, 1 consider each step in the model- 
ing process in terms of the four aspects of the predic- 
tive/explanatory modeling distinction: causation- 
association, theory-data, retrospective-prospective 
and bias-variance. Section 3 illustrates some of these 
differences via two examples. A discussion of the im- 
plications of the predict /explain conflation, conclu- 
sions, and recommendations are given in Section 4. 

2. TWO MODELING PATHS 

In the following I examine the process of statisti- 
cal modeling through the explain/predict lens, from 
goal definition to model use and reporting. For clar- 
ity, I broke down the process into a generic set of 
steps, as depicted in Figure 2. In each step I point 
out differences in the choice of methods, criteria, 
data, and information to consider when the goal is 
predictive versus explanatory. I also briefly describe 
the related statistics literature. The conceptual and 
practical differences invariably lead to a difference 
between a final explanatory model and a predic- 
tive one, even though they may use the same initial 
data. Thus, a priori determination of the main study 
goal as either explanatory or predictive^ is essential 
to conducting adequate modeling. The discussion in 
this section assumes that the main research goal has 
been determined as either explanatory or predictive. 



*The main study goal can also be descriptive. 



2.1 Study Design and Data Collection 

Even at the early stages of study design and data 
collection, issues of what and how much data to 
collect, according to what design, and which col- 
lection instrument to use are considered differently 
for prediction versus explanation. Consider sample 
size. In explanatory modeling, where the goal is to 
estimate the theory-based / with adequate precision 
and to use it for inference, statistical power is the 
main consideration. Reducing bias also requires suf- 
ficient data for model specification testing. Beyond 
a certain amount of data, however, extra precision 
is negligible for purposes of inference. In contrast, 
in predictive modeling, / itself is often determined 
from the data, thereby requiring a larger sample 
for achieving lower bias and variance. In addition, 
more data are needed for creating holdout datasets 
(see Section 2.2). Finally, predicting new individ- 
ual observations accurately, in a prospective man- 
ner, requires more data than retrospective inference 
regarding population-level parameters, due to the 
extra uncertainty. 

A second design issue is sampling scheme. For in- 
stance, in the context of hierarchical data (e.g., sam- 
pling students within schools) Afshartous and de 
Leeuw (2005) noted, "Although there exists an ex- 
tensive literature on estimation issues in multilevel 
models, the same cannot be said with respect to pre- 
diction." Examining issues of sample size, sample al- 
location, and multilevel modeling for the purpose of 
"predicting a future observable in the Jth group 
of a hierarchial dataset," they found that allocation 
for estimation versus prediction should be different: 
"an increase in group size n is often more benefi- 
cial with respect to prediction than an increase in 
the number of groups J. . . [whereas] estimation is 
more improved by increasing the number of groups 
J instead of the group size n." This relates directly 
to the bias-variance aspect. A related issue is the 
choice of / in relation to sampling scheme. Afshar- 
tous and de Leeuw (2005) found that for their hierar- 
chical data, a hierarchical /, which is more appropri- 
ate theoretically, had poorer predictive performance 
than a nonhierarchical /. 

A third design consideration is the choice between 
experimental and observational settings. Whereas 
for causal explanation experimental data are greatly 
preferred, subject to availability and resource con- 
straints, in prediction sometimes observational data 
are preferable to "overly clean" experimental data, if 
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they better represent the reahstic context of predic- 
tion in terms of the uncontroUed factors, the noise, 
the measured response, etc. This difference arises 
from the theory-data and prospective-retrospective 
aspects. Similarly, when choosing between primary 
data (data collected for the purpose of the study) 
and secondary data (data collected for other pur- 
poses), the classic criteria of data recency, relevance, 
and accuracy (Patzer, 1995) are considered from 
a different angle. For example, a predictive model 
requires the secondary data to include the exact 
X, Y variables to be used at the time of prediction, 
whereas for causal explanation different operational- 
izations of the constructs X,y may be acceptable. 

In terms of the data collection instrument, whereas 
in explanatory modeling the goal is to obtain a re- 
liable and valid instrument such that the data ob- 
tained represent the underlying construct adequately 
(e.g., item response theory in psychometrics) , for 
predictive purposes it is more important to focus on 
the measurement quality and its meaning in terms 
of the variable to be predicted. 

Finally, consider the field of design of experiments: 
two major experimental designs are factorial designs 
and response surface methodology (RSM) designs. 
The former is focused on causal explanation in terms 
of finding the factors that affect the response. The 
latter is aimed at prediction — finding the combina- 
tion of predictors that optimizes Y. Factorial designs 
employ a linear / for interpretability, whereas RSM 
designs use optimization techniques and estimate a 
nonlinear / from the data, which is less interpretable 
but more predictively accurate.^ 

2.2 Data Preparation 

We consider two common data preparation opera- 
tions: handling missing values and data partitioning. 

2.2.1 Handling missing values Most real datasets 
consist of missing values, thereby requiring one to 
identify the missing values, to determine the extent 



thank Douglas Montgomery for this insight. 



and type of missingness, and to choose a course of 
action accordingly. Although a rich literature ex- 
ists on data imputation, it is monopolized by an 
explanatory context. In predictive modeling, the so- 
lution strongly depends on whether the missing val- 
ues are in the training data and/or the data to be 
predicted. For example, Sarle (1998) noted: 

If you have only a small proportion of 
cases with missing data, you can simply 
throw out those cases for purposes of es- 
timation; if you want to make predictions 
for cases with missing inputs, you don't 
have the option of throwing those cases 
out. 

Sarle further listed imputation methods that are 
useful for explanatory purposes but not for predic- 
tive purposes and vice versa. One example is us- 
ing regression models with dummy variables that 
indicate missingness, which is considered unsatisfac- 
tory in explanatory modeling, but can produce ex- 
cellent predictions. The usefulness of creating miss- 
ingness dummy variables was also shown by Ding 
and Simonoff (2010). In particular, whereas the clas- 
sic explanatory approach is based on the Missing- 
At-Random, Missing-Completely-At-Random or Not- 
Missing-At-Random classification (Little and Ru- 
bin, 2002), Ding and Simonoff (2010) showed that 
for predictive purposes the important distinction is 
whether the missingness depends on Y or not. They 
concluded: 

In the context of classification trees, the 
relationship between the missingness and 
the dependent variable, rather than the 
standard missingness classification approach 
of Little and Rubin (2002). . . is the most 
helpful criterion to distinguish different miss- 
ing data methods. 

Moreover, missingness can be a blessing in a pre- 
dictive context, if it is sufficiently informative of Y 
(e.g., missingness in financial statements when the 
goal is to predict fraudulent reporting). 
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Finally, a completely different approach for han- 
dling missing data for prediction, mentioned by Sarle 
(1998) and further developed by Saar-Tsechansky 
and Provost (2007), considers the case where to- 
be-predicted observations are missing some predic- 
tor information, such that the missing information 
can vary across different observations. The proposed 
solution is to estimate multiple "reduced" models, 
each excluding some predictors. When predicting an 
observation with missingness on a certain set of pre- 
dictors, the model that excludes those predictors is 
used. This approach means that different reduced 
models are created for different observations. Al- 
though useful for prediction, it is clearly inappro- 
priate for causal explanation. 

2.2.2 Data partitioning A popular solution for 
avoiding overoptimistic predictive accuracy is to 
evaluate performance not on the training set, that 
is, the data used to build the model, but rather on a 
holdout sample which the model "did not see." The 
creation of a holdout sample can be achieved in var- 
ious ways, the most commonly used being a random 
partition of the sample into training and holdout 
sets. A popular alternative, especially with scarce 
data, is cross-validation. Other alternatives are re- 
sampling methods, such as bootstrap, which can 
be computationally intensive but avoid "bad par- 
titions" and enable predictive modeling with small 
datasets. 

Data partitioning is aimed at minimizing the com- 
bined bias and variance by sacrificing some bias in 
return for a reduction in sampling variance. A smaller 
sample is associated with higher bias when / is esti- 
mated from the data, which is common in predictive 
modeling but not in explanatory modeling. Hence, 
data partitioning is useful for predictive modeling 
but less so for explanatory modeling. With today's 
abundance of large datasets, where the bias sacrifice 
is practically small, data partitioning has become a 
standard preprocessing step in predictive modeling. 

In explanatory modeling, data partitioning is less 
common because of the reduction in statistical power. 
When used, it is usually done for the retrospective 
purpose of assessing the robustness of /. A rarer 
yet important use of data partitioning in explana- 
tory modeling is for strengthening model validity, 
by demonstrating some predictive power. Although 
one would not expect an explanatory model to be 
optimal in terms of predictive power, it should show 
some degree of accuracy (see discussion in Section 
4.2). 



2.3 Exploratory Data Analysis 

Exploratory data analysis (EDA) is a key initial 
step in both explanatory and predictive modeling. 
It consists of summarizing the data numerically and 
graphically, reducing their dimension, and "prepar- 
ing" for the more formal modeling step. Although 
the same set of tools can be used in both cases, 
they are used in a different fashion. In explanatory 
modeling, exploration is channeled toward the the- 
oretically specified causal relationships, whereas in 
predictive modeling EDA is used in a more free-form 
fashion, supporting the purpose of capturing rela- 
tionships that are perhaps unknown or at least less 
formally formulated. 

One example is how data visualization is carried 
out. Fayyad, Grinstein and Wierse (2002, page 22) 
contrasted "exploratory visualization" with "confir- 
matory visualization" : 

Visualizations can be used to explore data, 
to confirm a hypothesis, or to manipu- 
late a viewer. . . In exploratory visualiza- 
tion the user does not necessarily know 
what he is looking for. This creates a dy- 
namic scenario in which interaction is crit- 
ical. . . In a confirmatory visualization, the 
user has a hypothesis that needs to be 
tested. This scenario is more stable and 
predictable. System parameters are often 
predetermined. 

Hence, interactivity, which supports exploration 
across a wide and sometimes unknown terrain, is 
very useful for learning about measurement quality 
and associations that are at the core of predictive 
modeling, but much less so in explanatory model- 
ing, where the data are visualized through the the- 
oretical lens. 

A second example is numerical summaries. In a 
predictive context, one might explore a wide range 
of numerical summaries for all variables of inter- 
est, whereas in an explanatory model, the numerical 
summaries would focus on the theoretical relation- 
ships. For example, in order to assess the role of a 
certain variable as a mediator, its correlation with 
the response variable and with other covariates is 
examined by generating specific correlation tables. 

A third example is the use of EDA for assess- 
ing assumptions of potential models (e.g., normality 
or multicollinearity) and exploring possible variable 
transformations. Here, too, an explanatory context 
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would be more restrictive in terms of the space ex- 
plored. 

Finally, dimension reduction is viewed and used 
differently. In predictive modeling, a reduction in the 
number of predictors can help reduce sampling vari- 
ance. Hence, methods such as principal components 
analysis (PCA) or other data compression methods 
that are even less interpretable (e.g., singular value 
decomposition) are often carried out initially. They 
may later lead to the use of compressed variables 
(such as the first few components) as predictors, 
even if those are not easily interpretable. PCA is 
also used in explanatory modeling, but for a differ- 
ent purpose. For questionnaire data, PCA and ex- 
ploratory factor analysis are used to determine the 
validity of the survey instrument. The resulting fac- 
tors are expected to correspond to the underlying 
constructs. In fact, the rotation step in factor anal- 
ysis is specifically aimed at making the factors more 
interpretable. Similarly, correlations are used for as- 
sessing the reliability of the survey instrument. 

2.4 Choice of Variables 

The criteria for choosing variables differ markedly 
in explanatory versus predictive contexts. 

In explanatory modeling, where variables are seen 
as operationalized constructs, variable choice is based 
on the role of the construct in the theoretical causal 
structure and on the operationalization itself. A broad 
terminology related to different variable roles exists 
in various fields: in the social sciences — antecedent, 
consequent, mediator and moderator^ variables; in 
pharmacology and medical sciences — treatment and 
control variables; and in epidemiology — exposure and 
confounding variables. Carte and Craig (2003) men- 
tioned that explaining moderating effects has be- 
come an important scientific endeavor in the field of 
Management Information Systems. Another impor- 
tant term common in economics is endogeneity or 
"reverse causation," which results in biased param- 
eter estimates. Endogeneity can occur due to dif- 
ferent reasons. One reason is incorrectly omitting 
an input variable, say Z, from / when the causal 
construct Z is assumed to cause X and 3^. In a re- 
gression model of y on X, the omission of Z results 



"A moderator variable is one that influences the 
strength of a relationship between two other vari- 
ables, and a mediator variable is one that explains 
the relationship between the two other variables" (from 
http : //psych. wise . edu/henriques/mediator .html). 



in X being correlated with the error term. Winkel- 
mann (2008) gave the example of a hypothesis that 
health insurance (X) affects the demand for health 
services 3^. The operationalized variables are "health 
insurance status" {X) and "number of doctor con- 
sultations" (y). Omitting an input measurement 
Z for "true health status" {Z) from the regression 
model / causes endogeneity because X can be de- 
termined by Y (i.e., reverse causation), which man- 
ifests as X being correlated with the error term in 
/. Endogeneity can arise due to other reasons such 
as measurement error in X. Because of the focus 
in explanatory modeling on causality and on bias, 
there is a vast literature on detecting endogeneity 
and on solutions such as constructing instrumen- 
tal variables and using models such as two-stage- 
least-squares (2SLS). Another related term is simul- 
taneous causality, which gives rise to special mod- 
els such as Seemingly Unrelated Regression (SUR) 
(Zellner, 1962). In terms of chronology, a causal ex- 
planatory model can include only "control" vari- 
ables that take place before the causal variable (Gel- 
man et al., 2003). And finally, for reasons of model 
identifiability (i.e., given the statistical model, each 
causal effect can be identified), one is required to 
include main effects in a model that contains an in- 
teraction term between those effects. We note this 
practice because it is not necessary or useful in the 
predictive context, due to the acceptability of unin- 
terpretable models and the potential reduction in 
sampling variance when dropping predictors (see, 
e.g., the Appendix). 

In predictive modeling, the focus on association 
rather than causation, the lack of J-, and the prospec- 
tive context, mean that there is no need to delve into 
the exact role of each variable in terms of an under- 
lying causal structure. Instead, criteria for choosing 
predictors are quality of the association between the 
predictors and the response, data quality, and avail- 
ability of the predictors at the time of prediction, 
known as ex-ante availability. In terms of ex-ante 
availability, whereas chronological precedence of X 
to Y is necessary in causal models, in predictive 
models not only must X precede Y, but X must be 
available at the time of prediction. For instance, ex- 
plaining wine quality retrospectively would dictate 
including barrel characteristics as a causal factor. 
The inclusion of barrel characteristics in a predic- 
tive model of future wine quality would be impossi- 
ble if at the time of prediction the grapes are still 
on the vine. See the eBay example in Section 3.2 for 
another example. 
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2.5 Choice of Methods 

Considering the four aspects of causation-associa- 
tion, theory-data, retrospective-prospective and bias- 
variance leads to different choices of plausible meth- 
ods, with a much larger array of methods useful 
for prediction. Explanatory modeling requires inter- 
pretable statistical models / that are easily linked to 
the underlying theoretical model T . Hence the pop- 
ularity of statistical models, and especially regression- 
type methods, in many disciplines. Algorithmic meth- 
ods such as neural networks or /c-nearest-neighbors, 
and uninterpretable nonparametric models, are con- 
sidered ill-suited for explanatory modeling. 

In predictive modeling, where the top priority is 
generating accurate predictions of new observations 
and / is often unknown, the range of plausible meth- 
ods includes not only statistical models (interpretable 
and uninterpretable) but also data mining algorithms. 
A neural network algorithm might not shed light on 
an underlying causal mechanism T or even on /, 
but it can capture complicated associations, thereby 
leading to accurate predictions. Although model 
transparency might be important in some cases, it is 
of secondary importance: "Using complex predictors 
may be unpleasant, but the soundest path is to go 
for predictive accuracy first, then try to understand 
why" (Breiman, 2001b). 

Breiman (2001b) accused the statistical commu- 
nity of ignoring algorithmic modeling: 

There are two cultures in the use of statis- 
tical modeling to reach conclusions from 
data. One assumes that the data are gen- 
erated by a given stochastic data model. 
The other uses algorithmic models and 
treats the data mechanism as unknown. 
The statistical community has been com- 
mitted to the almost exclusive use of data 
models. 

Prom the explanatory/predictive view, algorithmic 
modeling is indeed very suitable for predictive (and 
descriptive) modeling, but not for explanatory mod- 
eling. 

Some methods are not suitable for prediction 
from the retrospective-prospective aspect, especially 
in time series forecasting. Models with coincident 
indicators, which are measured simultaneously, are 
such a class. An example is the model Airfare/- = 
f {OilPricet) , which might be useful for explaining 
the effect of oil price on airfare based on a causal 



theory, but not for predicting future airfare because 
the oil price at the time of prediction is unknown. 
For prediction, alternative models must be consid- 
ered, such as using a lagged OilPrice variable, or cre- 
ating a separate model for forecasting oil prices and 
plugging its forecast into the airfare model. Another 
example is the centered moving average, which re- 
quires the availability of data during a time window 
before and after a period of interest, and is therefore 
not useful for prediction. 

Lastly, the bias-variance aspect raises two classes 
of methods that are very useful for prediction, but 
not for explanation. The first is shrinkage methods 
such as ridge regression, principal components re- 
gression, and partial least squares regression, which 
"shrink" predictor coefficients or even eliminate them, 
thereby introducing bias into /, for the purpose of 
reducing estimation variance. The second class of 
methods, which "have been called the most influ- 
ential development in Data Mining and Machine 
Learning in the past decade" (Seni and Elder, 2010, 
page vi), are ensemble methods such as bagging 
(Breiman, 1996), random forests (Breiman, 2001a), 
boosting^ (Schapire, 1999), variations of those meth- 
ods, and Bayesian alternatives (e.g.. Brown, Van- 
nucci and Fearn, 2002). Ensembles combine multiple 
models to produce more precise predictions by av- 
eraging predictions from different models, and have 
proven useful in numerous applications (see the Net- 
flix Prize example in Section 3.1). 

2.6 Validation, iVIodei Evaluation and Model 
Selection 

Choosing the final model among a set of models, 
validating it, and evaluating its performance, differ 
markedly in explanatory and predictive modeling. 
Although the process is iterative, I separate it into 
three components for ease of exposition. 

2.6.1 Validation In explanatory modeling, valida- 
tion consists of two parts: model validation validates 
that / adequately represents J-, and model fit vali- 
dates that / fits the data {X, y}. In contrast, vali- 
dation in predictive modeling is focused on general- 
ization, which is the ability of / to predict new data 

new ) ^^cw } • 



^Although boosting algorithms were developed as ensem- 
ble methods, "[they can] be seen as an interesting regulariza- 
tion scheme for estimating a model" (Bohlmann and Hothorn, 
2007). 
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Methods used in explanatory modeling for model 
validation include model specification tests such as 
the popular Hausman specification test in econo- 
metrics (Hausman, 1978), and construct validation 
techniques such as reliability and validity measures 
of survey questions and factor analysis. Inference 
for individual coefficients is also used for detect- 
ing over- or underspecification. Validating model fit 
involves goodness-of-fit tests (e.g., normality tests) 
and model diagnostics such as residual analysis. Al- 
though indications of lack of fit might lead researchers 
to modify /, modifications are made carefully in 
light of the relationship with T and the constructs 
X,Y. 

In predictive modeling, the biggest danger to gen- 
eralization is overfitting the training data. Hence 
validation consists of evaluating the degree of over- 
fitting, by comparing the performance of / on the 
training and holdout sets. If performance is signifi- 
cantly better on the training set, overfitting is im- 
plied. 

Not only is the large context of validation markedly 
different in explanatory and predictive modeling, 
but so are the details. For example, checking for 
multicollinearity is a standard operation in assess- 
ing model fit. This practice is relevant in explana- 
tory modeling, where multicollinearity can lead to 
inflated standard errors, which interferes with infer- 
ence. Therefore, a vast literature exists on strategies 
for identifying and reducing multicollinearity, vari- 
able selection being one strategy. In contrast, for 
predictive purposes "multicollinearity is not quite as 
damning" (Vaughan and Berry, 2005). Makridakis, 
Wheelwright and Hyndman (1998, page 288) dis- 
tinguished between the role of multicollinearity in 
explaining versus its role in predicting: 

Multicollinearity is not a problem unless 
either (i) the individual regression coeffi- 
cients are of interest, or (ii) attempts are 
made to isolate the contribution of one 
explanatory variable to Y, without the in- 
fluence of the other explanatory variables. 
Multicollinearity will not affect the ability 
of the model to predict. 

Another example is the detection of influential ob- 
servations. While classic methods are aimed at de- 
tecting observations that are influential in terms of 
model estimation, Johnson and Geisser (1983) pro- 
posed a method for detecting influential observa- 
tions in terms of their effect on the predictive dis- 
tribution. 



2.6.2 Model evaluation Consider two performance 
aspects of a model: explanatory power and predic- 
tive power. The top priority in terms of model per- 
formance in explanatory modeling is assessing ex- 
planatory power, which measures the strength of re- 
lationship indicated by /. Researchers report Re- 
type values and statistical significance of overall F- 
type statistics to indicate the level of explanatory 
power. 

In contrast, in predictive modeling, the focus is on 
predictive accuracy or predictive power, which refer 
to the performance of / on new data. Measures of 
predictive power are typically out-of-sample metrics 
or their in-sample approximations, which depend on 
the type of required prediction. For example, predic- 
tions of a binary Y could be binary classifications 
(Y = 0, 1), predicted probabilities of a certain class 
[P(Y = 1)], or rankings of those probabilities. The 
latter are common in marketing and personnel psy- 
chology. These three different types of predictions 
would warrant different performance metrics. For 
example, a model can perform poorly in producing 
binary classifications but adequately in producing 
rankings. Moreover, in the context of asymmetric 
costs, where costs are heftier for some types of pre- 
diction errors than others, alternative performance 
metrics are used, such as the "average cost per pre- 
dicted observation." 

A common misconception in various scientific fields 
is that predictive power can be inferred from ex- 
planatory power. However, the two are different and 
should be assessed separately. While predictive power 
can be assessed for both explanatory and predictive 
models, explanatory power is not typically possible 
to assess for predictive models because of the lack 
of J- and an underlying causal structure. Measures 
such as and F would indicate the level of asso- 
ciation, but not causation. 

Predictive power is assessed using metrics com- 
puted from a holdout set or using cross-validation 
(Stone, 1974; Geisser, 1975). Thus, a major differ- 
ence between explanatory and predictive performance 
metrics is the data from which they are computed. In 
general, measures computed from the data to which 
the model was fitted tend to be overoptimistic in 
terms of predictive accuracy: "Testing the proce- 
dure on the data that gave it birth is almost certain 
to overestimate performance" (Mosteller and Tukey, 
1977). Thus, the holdout set serves as a more real- 
istic context for evaluating predictive power. 
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2.6.3 Model selection Once a set of models /i,/2, 
. . . has been estimated and validated, model selec- 
tion pertains to choosing among them. Two main 
differentiating aspects are the data-theory and bias- 
variance considerations. In explanatory modeling, 
the models are compared in terms of explanatory 
power, and hence the popularity of nested models, 
which are easily compared. Stepwise-type methods, 
which use overall F statistics to include and/or ex- 
clude variables, might appear suitable for achiev- 
ing high explanatory power. However, optimizing 
explanatory power in this fashion conceptually con- 
tradicts the validation step, where variable inclu- 
sion/exclusion and the structure of the statistical 
model are carefully designed to represent the theo- 
retical model. Hence, proper explanatory model se- 
lection is performed in a constrained manner. In the 
words of Jaccard (2001): 

Trimming potentially theoretically mean- 
ingful variables is not advisable unless one 
is quite certain that the coefficient for the 
variable is near zero, that the variable is 
inconsequential, and that trimming will 
not introduce misspecification error. 

A researcher might choose to retain a causal co- 
variate which has a strong theoretical justification 
even if is statistically insignificant. For example, in 
medical research, a covariate that denotes whether 
a person smokes or not is often present in models 
for health conditions, whether it is statistically sig- 
nificant or not.® In contrast to explanatory power, 
statistical significance plays a minor or no role in 
assessing predictive performance. In fact, it is some- 
times the case that removing inputs with small coef- 
ficients, even if they are statistically significant, re- 
sults in improved prediction accuracy (Greenberg 
and Parks, 1997; Wu, Harris and McAuley, 2007, 
and see the Appendix). Stepwise-type algorithms 
are very useful in predictive modeling as long as 
the selection criteria rely on predictive power rather 
than explanatory power. 

As mentioned in Section 1.6, the statistics liter- 
ature on model selection includes a rich discussion 
on the difference between finding the "true" model 
and finding the best predictive model, and on cri- 
teria for explanatory model selection versus predic- 
tive model selection. A popular predictive metric is 
the in-sample Akaike Information Criterion (AIC). 



I thank Ayala Cohen for this example. 



Akaike derived the AIC from a predictive viewpoint, 
where the model is not intended to accurately infer 
the "true distribution," but rather to predict future 
data as accurately as possible (see, e.g.. Berk, 2008; 
Konishi and Kitagawa, 2007). Some researchers dis- 
tinguish between AIC and the Bayesian information 
criterion (BIC) on this ground. Sober (2002) con- 
cluded that AIC measures predictive accuracy while 
BIC measures goodness of fit: 

In a sense, the AIC and the BIC provide 
estimates of different things; yet, they al- 
most always are thought to be in compe- 
tition. If the question of which estimator 
is better is to make sense, we must decide 
whether the average likelihood of a family 
[=BIC] or its predictive accuracy [=AIC] 
is what we want to estimate. 

Similarly, Dowe, Gardner and Oppy (2007) con- 
trasted the two Bayesian model selection criteria 
Minimum Message Length (MML) and Minimum 
Expected Kullback-Leibler Distance (MEKLD). 
They concluded. 

If you want to maximise predictive accu- 
racy, you should minimise the expected 
KL distance (MEKLD); if you want the 
best inference, you should use MML. 

Kadane and Lazar (2004) examined a variety of model 
selection criteria from a Bayesian decision-theoretic 
point of view, comparing prediction with explana- 
tion goals. 

Even when using predictive metrics, the fashion in 
which they are used within a model selection process 
can deteriorate their adequacy, yielding overopti- 
mistic predictive performance. Berk (2008) described 
the case where 

statistical learning procedures are often 
applied several times to the data with one 
or more tuning parameters varied. The 
AIC may be computed for each. But each 
AIC is ignorant about the information ob- 
tained from prior fitting attempts and how 
many degrees of freedom were expended 
in the process. Matters are even more com- 
plicated if some of the variables are trans- 
formed or receded. . . Some unjustified op- 
timism remains. 
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2.7 Model Use and Reporting 

Given all the differences that arise in the mod- 
eling process, the resulting predictive model would 
obviously be very different from a resulting explana- 
tory model in terms of the data used {{X,Y}), the 
estimated model /, and explanatory power and pre- 
dictive power. The use of / would also greatly differ. 

As illustrated in Section 1.1, explanatory models 
in the context of scientific research are used to de- 
rive "statistical conclusions" using inference, which 
in turn are translated into scientific conclusions re- 
garding J^,X,Y and the causal hypotheses. With 
a focus on theory, causality, bias and retrospective 
analysis, explanatory studies are aimed at testing or 
comparing existing causal theories. Accordingly the 
statistical section of explanatory scientific papers is 
dominated by statistical inference. 

In predictive modeling / is used to generate pre- 
dictions for new data. We note that generating pre- 
dictions from / can range in the level of difficulty, 
depending on the complexity of / and on the type 
of prediction generated. For example, generating a 
complete predictive distribution is easier using a 
Bayesian approach than the predictive likelihood ap- 
proach. 

In practical applications, the predictions might be 
the final goal. However, the focus here is on pre- 
dictive modeling for supporting scientific research, 
as was discussed in Section 1.2. Scientific predictive 
studies and articles therefore emphasize data, asso- 
ciation, bias-variance considerations, and prospec- 
tive aspects of the study. Conclusions pertain to 
theory-building aspects such as new hypothesis gen- 
eration, practical relevance, and predictability level. 
Whereas explanatory articles focus on theoretical 
constructs and unobservable parameters and their 
statistical section is dominated by inference, predic- 
tive articles concentrate on the observable level, with 
predictive power and its comparison across models 
being the core. 

3. TWO EXAMPLES 

Two examples are used to broadly illustrate the 
differences that arise in predictive and explanatory 
studies. In the first I consider a predictive goal and 
discuss what would be involved in "converting" it 
to an explanatory study. In the second example I 
consider an explanatory study and what would be 
different in a predictive context. See the work of 
Shmueli and Koppius (2010) for a detailed example 



"converting" the explanatory study of Gefen, Kara- 
hanna and Straub (2003) from Section 1 into a pre- 
dictive one. 

3.1 Netflix Prize 

Netflix is the largest online DVD rental service 
in the United States. In an effort to improve their 
movie recommendation system, in 2006 Netflix an- 
nounced a contest (http : //netf lixprize . com), mak- 
ing public a huge dataset of user movie ratings. Each 
observation consisted of a user ID, a movie title, and 
the rating that the user gave this movie. The task 
was to accurately predict the ratings of movie-user 
pairs for a test set such that the predictive accu- 
racy improved upon Netfiix's recommendation en- 
gine by at least 10%. The grand prize was set at 
$ 1,000,000. The 2009 winner was a composite of 
three teams, one of them from the AT&T research 
lab (see Bell, Koren and Volinsky, 2010). In their 
2008 report, the AT&T team, who also won the 2007 
and 2008 progress prizes, described their modeling 
approach (Bell, Koren and Volinsky, 2008). 

Let me point out several operations and choices 
described by Bell, Koren and Volinsky (2008) that 
highlight the distinctive predictive context. Start- 
ing with sample size, the very large sample released 
by Netflix was aimed at allowing the estimation of 
/ from the data, reflecting the absence of a strong 
theory. In the data preparation step, with relation 
to missingness that is predictively informative, the 
team found that "the information on which movies 
each user chose to rate, regardless of specific rat- 
ing value" turned out to be useful. At the data ex- 
ploration and reduction step, many teams including 
the winners found that the noninterpretable Sin- 
gular Value Decomposition (SVD) data reduction 
method was key in producing accurate predictions: 
"It seems that models based on matrix-factorization 
were found to be most accurate." As for choice of 
variables, supplementing the Netfiix data with infor- 
mation about the movie (such as actors, director) 
actually decreased accuracy: "We should mention 
that not all data features were found to be useful. 
For example, we tried to benefit from an extensive 
set of attributes describing each of the movies in 
the dataset. Those attributes certainly carry a sig- 
nificant signal and can explain some of the user be- 
havior. However, we concluded that they could not 
help at all for improving the accuracy of well tuned 
collaborative filtering models." In terms of choice of 
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methods, their solution was an ensemble of meth- 
ods that included nearest-neighbor algorithms, re- 
gression models, and shrinkage methods. In partic- 
ular, they found that "using increasingly complex 
models is only one way of improving accuracy. An 
apparently easier way to achieve better accuracy is 
by blending multiple simpler models." And indeed, 
more accurate predictions were achieved by collab- 
orations between competing teams who combined 
predictions from their individual models, such as the 
winners' combined team. All these choices and dis- 
coveries are very relevant for prediction, but not for 
causal explanation. Although the Netflix contest is 
not aimed at scientific advancement, there is clearly 
scientific value in the predictive models developed. 
They tell us about the level of predictability of on- 
line user ratings of movies, and the implicated use- 
fulness of the rating scale employed by Netflix. The 
research also highlights the importance of knowing 
which movies a user does not rate. And importantly, 
it sets the stage for explanatory research. 

Let us consider a hypothetical goal of explain- 
ing movie preferences. After stating causal hypothe- 
ses, we would define constructs that link user be- 
havior and movie features X to user preference 3^, 
with a careful choice of T. An operationalization 
step would link the constructs to measurable data, 
and the role of each variable in the causality struc- 
ture would be defined. Even if using the Netflix 
dataset, supplemental covariates that capture movie 
features and user characteristics would be absolutely 
necessary. In other words, the data collected and 
the variables included in the model would be differ- 
ent from the predictive context. As to methods and 
models, data compression methods such as SVD, 
heuristic-based predictive algorithms which learn / 
from the data, and the combination of multiple mod- 
els would be considered inappropriate, as they lack 
interpretability with respect to J- and the hypothe- 
ses. The choice of / would be restricted to statistical 
models that can be used for inference, and would 
directly model issues such as the dependence be- 
tween records for the same customer and for the 
same movie. Finally, the model would be validated 
and evaluated in terms of its explanatory power, and 
used to conclude about the strength of the causal re- 
lationship between various user and movie charac- 
teristics and movie preferences. Hence, the explana- 
tory context leads to a completely different modeling 
path and final result than the predictive context. 



It is interesting to note that most competing teams 
had a background in computer science rather than 
statistics. Yet, the winning team combines the two 
disciplines. Statisticians who see the uniqueness and 
importance of predictive modeling alongside explana- 
tory modeling have the capability of contributing to 
scientific advancement as well as achieving meaning- 
ful practical results (and large monetary awards). 

3.2 Online Auction Research 

The following example highlights the differences 
between explanatory and predictive research in on- 
line auctions. The predictive approach also illus- 
trates the utility in creating new theory in an area 
dominated by explanatory modeling. 

Online auctions have become a major player in 
providing electronic commerce services. eBay (www. 
eBay.com), the largest consumer-to-consumer auc- 
tion website, enables a global community of buy- 
ers and sellers to easily interact and trade. Empir- 
ical research of online auctions has grown dramat- 
ically in recent years. Studies using publicly avail- 
able bid data from websites such as eBay have found 
many divergences of bidding behavior and auction 
outcomes compared to ordinary offline auctions and 
classical auction theory. For instance, according to 
classical auction theory (e.g., Krishna, 2002), the 
flnal price of an auction is determined by a priori 
information about the number of bidders, their val- 
uation, and the auction format. However, flnal price 
determination in online auctions is quite different. 
Online auctions differ from offline auctions in vari- 
ous ways such as longer duration, anonymity of bid- 
ders and sellers, and low barriers of entry. These and 
other factors lead to new bidding behaviors that are 
not explained by auction theory. Another important 
difference is that the total number of bidders in most 
online auctions is unknown until the auction closes. 

Empirical research in online auctions has concen- 
trated in the flelds of economics, information sys- 
tems and marketing. Explanatory modeling has been 
employed to learn about different aspects of bidder 
behavior in auctions. A survey of empirical explana- 
tory research on auctions was given by Bajari and 
Hortacsu (2004). A typical explanatory study relies 
on game theory to construct J-", which can be done 
in different ways. One approach is to construct a 
"structural model," which is a mathematical model 
linking the various constructs. The major construct 
is "bidder valuation," which is the amount a bidder 
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is willing to pay, and is typically operationalized us- 
ing his observed placed bids. The structural model 
and operationalized constructs are then translated 
into a regression- type model [see, e.g.. Sections 5 
and 6 in Bajari and Hortacsu (2003)]. To illustrate 
the use of a statistical model in explanatory auc- 
tion research, consider the study by Lucking-Reiley 
et al. (2007) who used a dataset of 461 eBay coin 
auctions to determine the factors affecting the final 
auction price. They estimated a set of linear regres- 
sion models where Y = log (Price) and X included 
auction characteristics (the opening bid, the auc- 
tion duration, and whether a secret reserve price 
was used), seller characteristics (the number of pos- 
itive and negative ratings), and a control variable 
(book value of the coin). One of their four reported 
models was of the form 

log(Price) = /3o + A log{Book Value) 

+ (^2 \og{MinBid) + P-^Reserve 

+ PiNumDays + f3^PosRating 

+ fi^NegRating + e. 

The other three models, or "model specifications," 
included a modified set of predictors, with some in- 
teraction terms and an alternate auction duration 
measurement. The authors used a censored-Normal 
regression for model estimation, because some auc- 
tions did not receive any bids and therefore the price 
was truncated at the minimum bid. Typical explana- 
tory aspects of the modeling are: 

Choice of variables: Several issues arise from the 
causal-theoretical context. First is the exclusion 
of the number of bidders (or bids) as a determi- 
nant due to endogeneity considerations, where al- 
though it is likely to affect the final price, "it is 
endogenously determined by the bidders' choices." 
To verify endogeneity the authors report fitting 
a separate regression of F= Number of bids on 
all the determinants. Second, the authors discuss 
operationalization challenges that might result in 
bias due to omitted variables. In particular, the 
authors discuss the construct of "auction attrac- 
tiveness" {X) and their inability to judge mea- 
sures such as photos and verbal descriptions to 
operationalize attractiveness. 

Model validation: The four model specifications are 
used for testing the robustness of the hypothesized 
effect of the construct "auction length" across dif- 
ferent operationalized variables such as the contin- 
uous number of days and a categorical alternative. 



Model evaluation: For each model, its in-sample 
is used for determining explanatory power. 

Model selection: The authors report the four fitted 
regression models, including both significant and 
insignificant coefficients. Retaining the insignifi- 
cant covariates in the model is for matching / 
with T. 

Model use and reporting: The main focus is on in- 
ference for the /3's, and the final conclusions are 
given in causal terms. ("A seller's feedback rat- 
ings. . . have a measurable effect on her auction 
prices. . . when a seller chooses to have her auction 
last for a longer period of days [sic], this signifi- 
cantly increases the auction price on average.") 

Although online auction research is dominated by 
explanatory studies, there have been a few predic- 
tive studies developing forecasting models for an 
auction's final price (e.g., Jank, Shmueli and Wang, 
2008; Jap and Naik, 2008; Ghani and Simmons, 2004; 
Wang, Jank and Shmueli, 2008; Zhang, Jank and 
Shmueli, 2010). For a brief survey of online auc- 
tion forecasting research see the work of Jank and 
Shmueli (2010, Chapter 5). From my involvement in 
several of these predictive studies, let me highlight 
the purely predictive aspects that appear in this lit- 
erature: 

Choice of variables: If prediction takes place before 
or at the start of the auction, then obviously the 
total number of bids or bidders cannot be included 
as a predictor. While this variable was also omit- 
ted in the explanatory study, the omission was due 
to a different reason, that is, endogeneity. How- 
ever, if prediction takes place at time t during an 
ongoing auction, then the number of bidders/bids 
present at time t is available and useful for predict- 
ing the final price. Even more useful is the time 
series of the number of bidders from the start of 
the auction until time t as well as the price curve 
until time t (Bapna, Jank and Shmueli, 2008). 

Choice of methods: Predictive studies in online auc- 
tions tend to learn / from the data, using fiexi- 
ble models and algorithmic methods (e.g., CART, 
/c-nearest neighbors, neural networks, functional 
methods and related nonparametric smoothing- 
based methods, Kalman filters and boosting (see, 
e.g.. Chapter 5 in Jank and Shmueli, 2010). Many 
of these are not interpretable, yet have proven to 
provide high predictive accuracy. 
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Model evaluation: Auction forecasting studies eval- 
uate predictive power on holdout data. They re- 
port performance in terms of out-of-sample met- 
rics such as MAPE and RMSE, and are compared 
against other predictive models and benchmarks. 

Predictive models for auction price cannot provide 
direct causal explanations. However, by producing 
high-accuracy price predictions they shed light on 
new potential variables that are related to price and 
on the types of relationships that can be further 
investigated in terms of causality. For instance, a 
construct that is not directly measurable but that 
some predictive models are apparently capturing is 
competition between bidders. 

4. IMPLICATIONS, CONCLUSIONS AND 
SUGGESTIONS 

4.1 The Cost of Indiscrimination to 
Scientific Research 

Currently, in many fields, statistical modeling is 
used nearly exclusively for causal explanation. The 
consequence of neglecting to include predictive mod- 
eling and testing alongside explanatory modeling is 
losing the ability to test the relevance of existing 
theories and to discover new causal mechanisms. 
Feelders (2002) commented on the field of economics: 
"The pure hypothesis testing framework of economic 
data analysis should be put aside to give more scope 
to learning from the data. This closes the empirical 
cycle from observation to theory to the testing of 
theories on new data." The current accelerated rate 
of social, environmental, and technological changes 
creates a burning need for new theories and for the 
examination of old theories in light of the new real- 
ities. 

A common practice due to the indiscrimination 
of explanation and prediction is to erroneously in- 
fer predictive power from explanatory power, which 
can lead to incorrect scientific and practical conclu- 
sions. Colleagues from various fields confirmed this 
fact, and a cursory search of their scientific litera- 
ture brings up many examples. For instance, in ecol- 
ogy an article intending to predict forest beetle as- 
semblages infers predictive power from explanatory 
power ["To study. . . predictive power, ... we calcu- 
lated the i?^"; "We expect predictabilities with 
of up to 0.6" (Muller and Brandl, 2009)]. In eco- 
nomics, an article entitled "The predictive power 
of zero intelligence in financial markets" (Farmer, 



Patelli and Zovko, 2005) infers predictive power from 
a high value of a linear regression model. In epi- 
demiology, many studies rely on in-sample hazard 
ratios estimated from Cox regression models to infer 
predictive power, reflecting an indiscrimination be- 
tween description and prediction. For instance, Nabi 
et al. (2010) used hazard ratio estimates and statis- 
tical significance "to compare the predictive power 
of depression for coronary heart disease with that of 
cerebrovascular disease." In information systems, an 
article on "Understanding and predicting electronic 
commerce adoption" (Pavlou and Fygenson, 2006) 
incorrectly compared the predictive power of differ- 
ent models using in-sample measures ("To examine 
the predictive power of the proposed model, we com- 
pare it to four models in terms of R^ adjusted"). 
These examples are not singular, but rather they 
reflect the common misunderstanding of predictive 
power in these and other fields. 

Finally, a consequence of omitting predictive mod- 
eling from scientific research is also a gap between 
research and practice. In an age where empirical re- 
search has become feasible in many fields, the op- 
portunity to bridge the gap between methodological 
development and practical application can be easier 
to achieve through the combination of explanatory 
and predictive modeling. 

Finance is an example where practice is concerned 
with prediction whereas academic research is focused 
on explaining. In particular, there has been a re- 
liance on a limited number of models that are con- 
sidered pillars of research, yet have proven to per- 
form very poorly in practice. For instance, the CAPM 
model and more recently the Fama~French model 
are regression models that have been used for ex- 
plaining market behavior for the purpose of portfolio 
management, and have been evaluated in terms of 
explanatory power (in-sample R^ and residual anal- 
ysis) and not predictive accuracy.^ More recently, 
researchers have begun recognizing the distinction 
between in-sample explanatory power and out-of- 
sample predictive power (Goyal and Welch, 2007), 
which has led to a discussion of predictability magni- 
tude and a search for predictively accurate explana- 
tory variables (Campbell and Thompson, 2005). In 
terms of predictive modeling, the Chief Actuary of 



'Although in their paper Fama and French (1993) did split 
the sample into two parts, they did so for purposes of testing 
the sensitivity of model estimates rather than for assessing 
predictive accuracy. 
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the Financial Supervisory Authority of Sweden com- 
mented in 1999: "there is a need for models with 
predictive power for at least a very near future. . . 
Given sufficient and relevant data this is an area for 
statistical analysis, including cluster analysis and 
various kind of structure-finding methods" (Palm- 
gren, 1999). While there has been some predictive 
modeling using genetic algorithms (Chen, 2002) and 
neural networks (Chakraborty and Sharma, 2007), 
it has been performed by practitioners and nonfi- 
nance academic researchers and outside of the top 
academic journals. 

In summary, the omission of predictive modeling 
for theory development results not only in academic 
work becoming irrelevant to practice, but also in 
creating a barrier to achieving significant scientific 
progress, which is especially unfortunate as data be- 
come easier to collect, store and access. 

In the opposite direction, in fields that focus on 
predictive modeling, the reason for omitting explana- 
tory modeling must be sought. A scientific field is 
usually defined by a cohesive body of theoretical 
knowledge, which can be tested. Hence, some form 
of testing, whether empirical or not, must be a com- 
ponent of the field. In areas such as bioinformat- 
ics, where there is little theory and an abundance 
of data, predictive models are pivotal in generating 
avenues for causal theory. 

4.2 Explanatory and Predictive Power: 
Two Dimensions 

I have polarized explaining and predicting in this 
article in an effort to highlight their fundamental 
differences. However, rather than considering them 
as extremes on some continuum, I consider them 
as two dimensions. ^'^'^^ Explanatory power and pre- 
dictive accuracy are different qualities; a model will 
possess some level of each. 

A related controversial question arises: must an 
explanatory model have some level of predictive power 
to be considered scientifically useful? And equally, 
must a predictive model have sufficient explanatory 
power to be scientifically useful? For instance, some 
explanatory models that cannot be tested for pre- 
dictive accuracy yet constitute scientific advances 
are Darwinian evolution theory and string theory 



Similarly, descriptive models can be considered as a third 
dimension, where yet different criteria are used for assessing 
the strength of the descriptive model. 

^^I thank Bill Langford for the two-dimensional insight. 



in physics. The latter produces currently untestable 
predictions (Woit, 2006, pages x-xii). Conversely, 
there exist predictive models that do not properly 
"explain" yet are scientifically valuable. Galileo, in 
his book Two New Sciences, proposed a demonstra- 
tion to determine whether light was instantaneous. 
According to Mackay and Oldford (2000), Descartes 
gave the book a scathing review: 

The substantive criticisms are generally 
directed at Galileo's not having identified 
the causes of the phenomena he investi- 
gated. For most scientists at this time, 
and particularly for Descartes, that is the 
whole point of science. 

Similarly, consider predictive models that are based 
on a wrong explanation yet scientifically and prac- 
tically they are considered valuable. One well-known 
example is Ptolemaic astronomy, which until recently 
was used for nautical navigation but is based on a 
theory proven to be wrong long ago. While such ex- 
amples are extreme, in most cases models are likely 
to possess some level of both explanatory and pre- 
dictive power. 

Considering predictive accuracy and explanatory 
power as two axes on a two-dimensional plot would 
place different models (/), aimed either at expla- 
nation or at prediction, on different areas of the 
plot. The bi-dimensional approach implies that: (1) 
In terms of modeling, the goal of a scientific study 
must be specified a priori in order to optimize the 
criterion of interest; and (2) In terms of model eval- 
uation and scientific reporting, researchers should 
report both the explanatory and predictive qualities 
of their models. Even if prediction is not the goal, 
the predictive qualities of a model should be re- 
ported alongside its explanatory power so that it 
can be fairly evaluated in terms of its capabilities 
and compared to other models. Similarly, a predic- 
tive model might not require causal explanation in 
order to be scientifically useful; however, reporting 
its relation to causal theory is important for pur- 
poses of theory building. The availability of infor- 
mation on a variety of predictive and explanatory 
models along these two axes can shed light on both 
predictive and causal aspects of scientific phenom- 
ena. The statistical modeling process, as depicted 
in Figure 2, should include "overall model perfor- 
mance" in terms of both predictive and explanatory 
qualities. 
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4.3 The Cost of Indiscrimination to the 
Field of Statistics 

Dissolving the ambiguity surrounding explanatory 
versus predictive modeling is important for advanc- 
ing our field itself. Recognizing that statistical 
methodology has focused mainly on inference indi- 
cates an important gap to be filled. While our lit- 
erature contains predictive methodology for model 
selection and predictive inference, there is scarce sta- 
tistical predictive methodology for other modeling 
steps, such as study design, data collection, data 
preparation and EDA, which present opportunities 
for new research. Currently, the predictive void has 
been taken up the field of machine learning and data 
mining. In fact, the differences, and some would say 
rivalry, between the fields of statistics and data min- 
ing can be attributed to their different goals of ex- 
plaining versus predicting even more than to factors 
such as data size. While statistical theory has fo- 
cused on model estimation, inference, and fit, ma- 
chine learning and data mining have concentrated 
on developing computationally efficient predictive 
algorithms and tackling the bias-variance trade-off 
in order to achieve high predictive accuracy. 

Sharpening the distinction between explanatory 
and predictive modeling can raise a new awareness 
of the strengths and limitations of existing meth- 
ods and practices, and might shed light on current 
controversies within our field. One example is the 
disagreement in survey methodology regarding the 
use of sampling weights in the analysis of survey 
data (Little, 2007). Whereas some researchers advo- 
cate using weights to reduce bias at the expense of 
increased variance, and others disagree, might not 
the answer be related to the final goal? 

Another ambiguity that can benefit from an ex- 
planatory/predictive distinction is the definition of 
parsimony. Some claim that predictive models should 
be simpler than explanatory models: "Simplicity is 
relevant because complex families often do a bad job 
of predicting new data, though they can be made 
to fit the old data quite well" (Sober, 2002). The 
same argument was given by Hastie, Tibshirani and 
Friedman (2009): "Typically the more complex we 
make the model, the lower the bias but the higher 
the variance." In contrast, some predictive models 
in practice are very complex, and indeed Breiman 



I thank Foster Provost from NYU for this observation. 



(2001b) commented: "in some cases predictive mod- 
els are more complex in order to capture small nu- 
ances that improve predictive accuracy." Zellner 
(2001) used the term "sophisticatedly simple" to de- 
fine the quality of a "good" model. I would suggest 
that the definitions of parsimony and complexity are 
task-dependent: predictive or explanatory. For ex- 
ample, an "overly complicated" model in explana- 
tory terms might prove "sophisticatedly simple" for 
predictive purposes. 

4.4 Closing Remarks and Suggestions 

The consequences from the explanatory/predictive 
distinction lead to two proposed actions: 

1. It is our responsibility to be aware of how statisti- 
cal models are used in research outside of statis- 
tics, why they are used in that fashion, and in 
response to develop methods that support sound 
scientific research. Such knowledge can be gained 
within our field by inviting scientists from differ- 
ent disciplines to give talks at statistics confer- 
ences and seminars, and to require graduate stu- 
dents in statistics to read and present research 
papers from other disciplines. 

2. As a discipline, we must acknowledge the differ- 
ence between explanatory, predictive and descrip- 
tive modeling, and integrate it into statistics ed- 
ucation of statisticians and nonstatisticians, as 
early as possible but most importantly in "re- 
search methods" courses. This requires creating 
written materials that are easily accessible and 
understandable by nonstatisticians. We should 
advocate both explanatory and predictive mod- 
eling, clarify their differences and distinctive sci- 
entific and practical uses, and disseminate tools 
and knowledge for implementing both. One par- 
ticular aspect to consider is advocating a more 
careful use of terms such as "predictors," "pre- 
dictions" and "predictive power," to reduce the 
effects of terminology on incorrect scientific con- 
clusions. 

Awareness of the distinction between explanatory 
and predictive modeling, and of the different scien- 
tific functions that each serve, is essential for the 
progress of scientific knowledge. 

APPENDIX: IS THE "TRUE" MODEL THE 
BEST PREDICTIVE MODEL? A LINEAR 
REGRESSION EXAMPLE 

Consider T to be the true function relating con- 
structs X and 3^ and let us assume that / is a valid 
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operationalization of J-". Choosing an intentionally 
biased function /* in place of / is clearly undesir- 
able from a theoretical-explanatory point of view. 
However, we will show that /* can be preferable to 
/ from a predictive standpoint. 

To illustrate this, consider the statistical model 
f{x) = /3ixi + /32X2 + e which is assumed to be cor- 
rectly specified with respect to Using data, we 
obtain the estimated model /, which has the prop- 
erties 

(2) Bias = 0, 

Var(/(x)) = Yai{xJi + xa/Sa) 

(3) 

= a^x\X'X)-^x, 

where x is the vector x = [xi,X2]' , and X is the de- 
sign matrix based on both predictors. Combining 
the squared bias with the variance gives 

EPE = E(Y - f{x)f 

(4) =a'^ + Q + a'^x'{X'Xy^x 

= a'^{l + x'{X'Xy^x). 

In comparison, consider the estimated underspec- 
ified form f*{x) ='yixi. The bias and variance here 
are given by Montgomery, Peck and Vining (2001, 
pages 292-296): 

Bias = xi7i - (xi/3i + xa/^a) 

= xi(x'ixi)"^x'i(xi/?i + X2/32) 

- (Xi/3i +X2/?2), 

Var(/*(x)) = xi Var(7i)xi = cj^xi(x'iXi)~"^xi. 
Combining the squared bias with the variance gives 

EPE = (xi(x'iXi)-^x;X2/32 - X2P2f 

(5) 

+ 0-2(1 +Xi(x'iXl)~^x'i). 

Although the bias of the underspecified model /*(x) 
is larger than that of /(x), its variance can be smaller, 
and in some cases so small that the overall EPE will 
be lower for the underspecified model. Wu, Harris 
and McAuley (2007) showed the general result for an 
underspecified linear regression model with multiple 
predictors. In particular, they showed that the un- 
derspecified model that leaves out q predictors has 
a lower EPE when the following inequality holds: 

(6) qa^ > p'^Xl^il - Hi)X2l32. 

This means that the underspecified model produces 
more accurate predictions, in terms of lower EPE, 
in the following situations: 



• when the data are very noisy (large a); 

• when the true absolute values of the left-out pa- 
rameters (in our example (32) are small; 

• when the predictors are highly correlated; and 

• when the sample size is small or the range of left- 
out variables is small. 

The bottom line is nicely summarized by Hagerty 
and Srinivasan (1991): "We note that the practice 
in applied research of concluding that a model with 
a higher predictive validity is "truer," is not a valid 
inference. This paper shows that a parsimonious but 
less true model can have a higher predictive validity 
than a truer but less parsimonious model." 
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